Support group query attention in Attention(23) CUDA#27082
Support group query attention in Attention(23) CUDA#27082
Conversation
Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>
Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>
Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>
Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>
|
@copilot Can you enable the following three tests Just turn |
Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>
|
@copilot |
Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>
Removed GQA test filters from
GQA tests will now run on CUDA. |
|
Please fix the CI https://github.com/microsoft/onnxruntime/actions/runs/21189348181/job/60952159971?pr=27082 The error message: |
Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>
Fixed CI linker error in commit 213a82d. The issue was that the GQA |
@copilot |
Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>
Reverted float support for GQA in commit f79c509. Following GroupQueryAttention pattern:
|
| // Check if this is Group Query Attention (GQA) | ||
| const bool is_gqa = parameters.kv_num_heads != parameters.q_num_heads; | ||
|
|
||
| if (is_gqa) { |
There was a problem hiding this comment.
TODO: Currently, we do not support 4D inputs of QKV.
There was a problem hiding this comment.
The support requires kernel changes in FlashAttention and EfficientAttention. If we want to support 4d, the best way would be another cuda kernel to transpose/reshape the input from 4d to 3d before feeding it to those two attention kernels.
This pull request introduces improvements and bug fixes to the attention mechanism in ONNX Runtime, particularly focusing on the handling of attention masks and the computation of attention probabilities for both CPU and CUDA providers. The most significant changes include the addition of a new CUDA implementation for converting boolean attention masks to sequence lengths with validation, and several bug fixes in the CPU attention kernel to correctly handle head indices during computation.
CUDA Attention Mask Conversion and Validation:
attention_mask_impl.cuandattention_mask_impl.h) that efficiently converts a boolean attention mask to sequence lengths for GQA (Grouped Query Attention) kernels. This includes:CPU Attention Kernel Bug Fixes:
attention.cc) by replacing incorrect uses of(head_i % parameters.kv_num_heads)andhead_iwith the correcthead_kiandhead_viindices when accessing the K and V matrices. This ensures correct head alignment, especially in multi-head or grouped attention scenarios. [1] [2] [3] [4]NOT supported in this PR
Cross attention (q_sequence_kength != kv_sequence_length)
4d QKV (BNSH format)
is_causal=False
fp32
Softmax precision
qk_output_mode